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Technical  Section 

The  goal  of  this  research  was  to  extract  knowledge  from  text  collections  as  large  and 
diverse  as  the  Web  without  any  human  input.  We  produced  TextRunner,  an  Open 
Information  Extraction  system  that  mines  massive,  heterogeneous  text  corpora  to  extract 
relational  tuples  without  any  relation-speeifie  input  or  training  data. 

•  We  introduced  simple  syntactic  and  lexical  constraints  on  how  binary  relationships 
are  expressed  via  verbs  in  English  sentences.  The  syntactic  constraint  is  captured  by 
a  compact  regular  expression  over  parts  of  speech,  and  the  lexical  constraint  is 
enforced  by  statistics  computed  over  the  Google  N-gram  corpus. 

•  We  implemented  the  constraints  in  the  OCCAM  extractor,  which  achieves 
substantially  improved  prccision/rccall  over  state-of-the-art  open  extractors  such  as 
TEXTRUNNER  and  WOE.  OCCAM  has  more  than  twice  the  area  under  the 
precision-recall  curve  compared  with  TEXTRUNNER,  47%  more  than  WOEpos,  and 
1 7%  more  than  WOEparsc.  OCCAM  is  also  30x  faster  than  WOEparse  on  average. 

•  We  developed  a  contradiction  detection  system  called  AuContrairc,  which  can  find 
contradictions  between  various  facts  present  in  the  web  text.  It  applies  probabilistic 
inference  over  information  of  meronymy,  functionality  of  relations,  and  ambiguity  in 
entities  to  distinguish  between  apparent  contradictions  and  true  contradictions. 
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We  developed  Grounder,  an  entity  resolution  system,  whieh  maps  surface  forms  of 
named  entities  into  known  entities  in  the  Wikipedia  taxonomy.  The  new  approach  is 
based  on  probabilistic  techniques  that  combine  evidence  from  the  prior  popularity  of 
entities  as  well  as  the  similarity  between  the  Wikipedia  page  and  the  current  webpage. 

We  developed  a  fact  re-ranker  for  TextRunner  that,  given  a  set  of  facts  for  a  query, 
computes  a  best  order  in  which  they  should  be  presented  to  the  user.  This  system  is 
based  upon  classifiers  that  classify  whether  a  fact  is  basic  to  the  query,  whether  the 
fact  is  surprising  and  unexpected,  etc.  The  basic  facts  and  surprising  facts  are  ranked 
higher.  These  classifiers  generalize  from  a  limited  set  of  training  data.  Additionally, 
our  system  is  able  to  personalize  the  search  results  based  on  data  provided  by  a  user 
in  the  past  queries. 

We  developed  a  novel  hypemym  extractor  that  combines  lexico-syntactie  patterns 
with  probabilistic  techniques  such  as  Hidden  Markov  models  to  infer  whether  an 
entity  pair  is  in  hypemym-hyponym  relationship. 

We  are  currently  investigating  increasing  the  precision  and  recall  of  TextRunner  by 
incorporating  additional  linguistic  resources  and  information.  We  are  implementing  a 
separate  classifier  for  each  linguistic  construct  such  as  appositives,  relative  clauses 
etc.  We  hope  that  a  more  precise  and  comprehensive  TextRunner  system  will 
immensely  benefit  all  research  that  uses  TextRunner  extractions  as  input. 

We  are  currently  building  a  comprehensive  repository  of  selectional  preferences  for 
arguments  of  each  predicate  based  on  statistical  analysis  over  TextRunner  extractions. 
Our  preliminary  results  are  very  promising  and  we  expect  to  release  the  data  for 
thousands  of  predicates  in  the  near  future. 

We  arc  investigating  the  next  generation  information  extractor  that  will  automatically 
build  an  expectation  for  possible  future  extractions  as  it  reads  text.  For  example, 
based  on  the  current  extraction  it  may  add  a  template  extraction  in  the  database  for  all 
objects  of  the  same  type.  This  repository  of  templates  will  help  guide  the  later 
extractions  as  more  complex  text  is  read. 

AuContraire  discovered  a  surprising  characterization  of  contradictions  on  the  Web:  of 
the  seeming  contradictions  (extractions  of  a  functional  relation  whose  argument 
values  disagree),  only  1.2%  are  actual  contradictions,  from  a  set  of  TextRunner 
extractions  from  1 1 7  million  Web  pages.  The  false  contradictions  have  argument 
values  that  are  compatible  due  to  synonymy  or  mcronymy  (e.g.  'Vienna'  does  not 
contradict  'Austria').  Ambiguous  argument  values  that  refer  to  different  real-world 
entities  also  produce  false  contradictions.  Despite  the  badly  skewed  data, 

AuContraire  found  true  contradictions  with  precision  1 .0  at  recall  0. 1 5  and  with 
precision  0.48  at  recall  0.29. 


AuContraire  learned  functionality  of  predicates  and  ambiguity  of  arguments  in 
alternating  EM-like  iterations.  It  achieved  precision  0.67  at  recall  0.55  for 
functionality  and  precision  0.87  at  recall  0.34  for  ambiguity. 

Grounder  demonstrated  the  importance  of  prior  probabilities  in  mapping  references  in 
context  to  Wikipedia  articles.  Cosine  similarity  between  a  document  and  the 
Wikipedia  article  gave  precision  0.67  at  recall  0.27,  while  a  prior  that  ignores  context 
gave  precision  1 .0  at  recall  0.31.  Combining  both  sources  of  knowledge  gave  results 
superior  to  either  alone,  achieving  precision  0.91  at  recall  0.62. 

Our  HypemymFinder  found  at  least  one  correct  hypemym  for  proper  nouns  with 
precision  0.90  at  recall  0.32  (as  compared  with  WordNct  that  covered  only  17%  of 
the  proper  nouns  in  our  test  set).  For  common  nouns  HypemymFinder  had  precision 
0.90  at  recall  0.67.  An  HMM-based  classifier  handles  instances  not  covered  by 
lexico-syntactie  patterns,  increasing  recall  by  0.06  for  proper  nouns  and  by  0.02  for 
common  nouns. 

Our  fact  re-ranker  evaluated  several  definitions  of  interestingness  of  a  fact.  We  found 
that  three  definitions,  basic  facts,  specific  facts  and  distinguishing  facts,  can  make  a 
fact  interesting.  Often  these  span  different  kinds  of  facts.  Our  re-ranker  was  able  to 
increase  the  number  of  interesting  facts  in  the  first  thirty  results  of  the  query  from 
42%  to  64%  resulting  in  a  better  user  experience  for  the  users. 

We  reimplemented  TextRunner’s  tuple  extractor  using  a  self-supervised  Conditional 
Random  Field  (CRF).  TextRunner  learns  a  relation-independent  extractor  by 
automatically  generating  positive  and  negative  training  examples  from  parse  trees  and 
a  small  set  of  relation-independent  heuristics.  Where  the  previous  TextRunner  was 
limited  to  binary  tuples  of  the  form  (argl ,  pred,  arg2),  the  new  implementation  finds 
tuples  with  an  arbitrary  number  of  arguments. 

We  evaluated  TextRunner’s  open  extraction  model  relative  to  the  traditional 
extraction  paradigm  in  which  a  relation  is  specified  in  an  advance,  along  with  hand- 
labeled  training  data  per  relation. 

Wc  built  the  Holmes  system  on  top  of  TextRunner,  which  is  able  to  infer  new  facts 
not  seen  on  any  page  in  the  corpus.  It  does  this  by  combining  facts  from  multiple 
web  pages  using  a  small  set  of  rules.  Furthermore  we  demonstrated  that  relations 
extracted  from  the  web  have  the  property  of  being  Approximately  Pseudo-funetional 
(most  entities  appear  with  only  a  small  number  of  other  entities),  and  this  property 
allows  Holmes's  inference  to  scale  linearly  with  the  size  of  the  input  corpus.  For 
some  example  queries,  we  demonstrated  that  Holmes  doubled  recall  over  the  baseline 
TextRunner  system,  and  can  do  so  in  only  a  few  CPU  minutes. 


•  Wc  developed  Alice,  one  of  the  first  learning  agents  whose  goal  is  to  automatically 
discover  a  domain  theory  -  a  collection  of  concepts,  facts  and  generalizations  for  a 
given  topic  —  directly  from  Web  text.  Alice  uses  relational  tuples  extracted  by 
TextRurmer  to  learn  new  concepts  and  build  relationships  between  concepts  in  a 
hierarchy. 

•  Wc  demonstrated  that  the  new  implementation  of  TextRunner  can  extract  a  variety  of 
relations  with  precision  88.3%  and  recall  45.2%,  while  the  previous  implementation 
had  precision  86.6%  at  recall  23.2%.  This  is  gives  an  FI  measure  63.4%  higher  than 
the  previous  implementation. 

•  We  found  that  without  any  relation-specific  input,  TextRunner  obtains  the  same 
precision  with  lower  recall  as  a  traditional  supervised  extractor  trained  using 
hundreds  and  sometimes  thousands,  of  labeled  examples  per  relation. 

•  We  are  currently  observing  that  relational  tuples  located  by  TextRunner  can  be  used 
to  bootstrap  training  of  individual  relations.  TextRunner  automatically  provides 
several  orders  of  magnitude  more  training  data  without  the  cost  of  hand-tagging, 
yielding  substantial  gains  in  FI  on  a  per-rclation  basis. 
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http://www.cs.washington.edu/researcli/knowitall/ 

http://ai.cs.washington.edu/proiccts/opcn-information-cxtraction 

http.V/www.cs.washington.cdu/rcscarch/tcxtrunncr/ 

http://rcverb.cs.washington.edu/ 

http://turingc.cs.washington.edu:  1234/lda  sp  demo  v3/lda  sp/rclations/ 

http://abstract.cs.washington.edu/~tlin/leibniz/ 


